Journal of Statistical Software 21
work with it, then tidy tools will be inextricably linked to tidy data. This makes it easy to
get stuck in a local maximum where independently changing data structures or data tools
will not improve workflow. Breaking out of this local maximum is hard. It requires long-term
concerted effort with the prospect of many false starts. While I hope that the tidy data
framework is not one of those false starts, I also do not see it as the final solution. I hope
others will build on this framework to develop even better data storage strategies and better
tools.
Surprisingly, I have found few principles to guide the design of tidy data, which acknowledge
both statistical and cognitive factors. To date, my work has been driven by my experience
doing data analysis, my knowledge of relational database design, and my own rumination on
the tools of data analysis. The human factors, user-centered design, and human-computer
interaction communities may be able to add to this conversation, but the design of data and
tools to work with it has not been an active research topic in those fields. In the future, I
hope to use methodologies from these fields (user-testing, ethnography, talk-aloud protocols)
to improve our understanding of the cognitive side of data analysis, and to further improve
our ability to design appropriate tools.
Other formulations of tidy data are possible. For example, it would be possible to construct
a set of tools for dealing with values stored in multidimensional arrays. This is a common
storage format for large biomedical datasets generated by microarrays or fMRI’s. It is also
necessary for many multivariate methods based on matrix manipulation. Fortunately, because
there are many efficient tools for working with high-dimensional arrays, even sparse ones,
such an array-tidy format is not only likely to be quite compact and efficient, it should also
be able to easily connect with the mathematical basis of statistics. This, in fact, is the
approach taken by the pandas Python data analysis library (McKinney 2010). Even more
interestingly, we could consider tidy tools that can ignore the underlying data representation
and automatically choose between array-tidy and dataframe-tidy formats to optimize memory
usage and performance.
Apart from tidying, there are many other tasks involved in cleaning data: parsing dates
and numbers, identifying missing values, correcting character encodings (for international
data), matching similar but not identical values (created by typos), verifying experimental
design, and filling in structural missing values, not to mention model-based data cleaning that
identifies suspicious values. Can we develop other frameworks to make these tasks easier?
Acknowledgments
This work would not be possible without the many conversations I have had about data and
how to deal with them statistically. I would particularly like to thank Phil Dixon, Di Cook,
and Heike Hofmann, who have put up with numerous questions over the years. I would also
like to thank the users of the reshape package who have provided many challenging problems,
and my students who continue to challenge me to explain what I know in a way that they
can understand. I would also like to thank Bob Muenchen, Burt Gunter, Nick Horton and
Garrett Grolemund who gave detailed comments on earlier drafts, and to particularly thank
Ross Gayler who provided the nice example of the challenges of defining a variable and Ben
Bolker who showed me the natural equivalence between a paired t test and a mixed effects
model.